Using Regular Expressions in the User Dictionary

The regular expression syntax used by the Recognition User Dictionary differs slightly from the regular expression syntax used by the Autoredact API. A few specific differences are listed in Comparison with Other Implementations of Regular Expressions.

When an item in the User dictionary is a regular expression, it means that during UD-checking, strings passed for checking by a recognition module will be checked to see whether they conform to the pattern defined by the regular expression.

In general, the UD-checking of a string means that the UDitems of the User dictionary will be compared to the string one after another, until either an identical compliant item is found or the last element of the dictionary is reached. If it reaches the end without success, it marks the offending string as non-compliant.

The more information a character recognition program has about its target, the more accurate the recognition can be. There are cases where it is theoretically impossible to identify a character only from the pixels of its image. Well known examples in certain typefaces are 0 (zero) and O, or I (capital i), l (lowercase l) and 1 (digit 1). But in practice, where characters are not ideal, are broken or touching each other, the number of such cases increases rapidly.

In everyday text it is possible to make decisions based on the context (e.g., it is rare to have a capital letter in a word after a lowercase one) or using dictionaries. But in form processing and automatic data-entry applications, characters are usually processed field-by-field. Sometimes it is possible to utilize a custom dictionary (word list) to help the OCR process, as in the case of country names, states, or cities (see Improving accuracy with the checking module in the Tutorial for more information on user defined dictionaries). However, for the recognition of numerals, artificial identifiers, dates, addresses, etc., the structure of the data must be described rather than enumerating all possibilities.

A regular expression (also called a mask or a pattern) describes a field by specifying each of its characters as being a member of a class or a set of characters.

For example, you are scanning diskette labels to recognize the serial numbers printed on them, and the serial numbers are in the following format:

S/N RN40I123456

The serial numbers start with a fixed string of characters and a space ("S/N ") followed by two capital letters, two digits, another capital letter, and then 6 digits. Though the S/N itself is not part of the data we want to capture, including it in our mask builds confidence in the recognition result. We can describe the zone with the following regular expression:

S/N *[A-Z][A-Z][0-9][0-9][A-Z][0-9]{6}

It starts with explicitly defining the fixed S/N characters; then come the space and a multiplier after it, an asterisk. Since we know that the serial number itself can be somewhat closer to or farther from the S/N characters, we expect that the OCR may find two or more spaces between them. So we state in our regular expression that any number of spaces can be there, including no space at all. Then we follow with the first two capitals of the individual serial number. Here we use two identical character classes or sets, enclosed in brackets. Within the brackets we can explicitly enumerate the characters the set contains, as in [RSTFC], or ranges can be used as in the example above. Ranges are inclusive, that is both of their limit characters are included in the set. It is also possible to mix these two methods, defining some characters explicitly and others by ranges, as in [RSTFC0-9] which defines a character to be any of the capitals listed or any digit. In the above example we use similar character classes to define the other elements of the serial number. The last construct, [0-9]{6} uses another kind of multiplier as shorthand for repeating the same set six times.